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In  the  first  part  of  this  project,  we  focused  on  performance  of  individual  applications.  We 
developed  an  out-of-core  parallel  sparse  cholesky  solver  which  achieved  a  maximum  application- 
level  I/O  rate  of  430  MB/s  on  our  16-processor  IBM  SP-2.  This  year,  we  have  broadened  our  focus 
to  cover  a  wider  variety  of  applications.  Our  goal  was  to  determine  the  requirements  and  techniques 
for  a  large  class  of  applications. 

To  this  end,  we  have  analyzed  a  diverse  suite  of  I/O-intensive  parallel  applications  to  determine 
their  I/O  requirements  and  the  implications  of  these  requirements  for  the  design  of  I/O  systems 
for  parallel  machines.  We  attempted  to  answer  the  following  questions.  First,  what  are  the  steady- 
state  and  peak  I/O  rates  required  by  the  application?  These  rates  indicate  how  aggressive  the  I/O 
system  needs  to  be.  Second,  what  spatial  patterns,  if  any,  exist  in  the  sequence  of  I/O  requests? 
In  particular,  what  are  the  common  request  sizes  and  whether  I/O  requests  are  sequential.  Third, 
what  is  the  degree  of  intra-processor  and  inter-processor  locality  in  I/O  accesses?  These  measures 
indicate  whether  caching  previously  accessed  data  is  likely  to  improve  performance  and  if  so,  where 
should  the  cache(s)  be  placed  and  what  caching  policies  should  be  used.  This  information  is  also 
useful  to  understand  the  impact  of  alternative  disk  placements  on  application  performance.  For 
example,  if  all  processors  in  an  application  access  only  their  own  partition  of  the  data,  the  use 
of  private  disks  instead  of  shared  disks  can  reduce  communication  requirements  and  can  increase 
aggregate  I/O  bandwidth.  On  the  other  hand,  if  the  data  in  a  partition  is  written  by  the  owning 
processor  but  read  by  all  processors,  there  may  be  no  significant  performance  penalty  to  using 
shared  disks.  Fourth,  does  the  application  structure  allow  programmers  to  disclose  future  I/O 
requests  to  the  I/O  system?  If  this  is  indeed  the  case,  it  would  provide  an  opportunity  for  the 
I/O  system  to  improve  the  utilization  of  the  storage  devices  as  well  as  reduce  the  latency  of  I/O 
requests  [?,  ?].  Finally,  what  patterns,  if  any,  exist  in  the  sequence  of  inter-arrival  times  for  I/O 
requests?  This  information  would  allow  the  I/O  system  to  estimate  when  I/O  requests  that  have 
been  previously  disclosed  will  actually  be  made.  This  can  allow  it  to  better  schedule  prefetches  for 
the  sequence  of  1/ 0  requests  disclosed  by  individual  applications  as  well  as  those  disclosed  by  a 
group  of  applications. 

To  address  these  questions,  we  have  analyzed  I/O  request  traces  for  a  diverse  set  of  I/O-intensive 
parallel  applications.  This  set  includes  four  non-scientific  applications  (IBM’s  DB2  Parallel  Edi¬ 
tion,  datamining,  a  parallel  web  server  and  parallel  textual  search)  and  three  parallel  scientific 
applications.  These  applications  have  been  tuned,  to  various  degrees,  to  achieve  good  I/O  perfor¬ 
mance.  We  believe  that  this  is  important  as  studying  applications  which  have  not  been  tuned  for 
I/O  performance  can  lead  to  misleading  conclusions  [?].  We  ran  these  applications  on  an  IBM  SP-2 
and  captured  the  I/O  requests  using  the  trace  facility  provided  by  AIX  4.1.  In  addition,  we  have 
acquired  the  I/O  request  traces  made  available  by  the  Pablo  group  at  the  University  of  Illinois, 
Urbana-Champaign  [?].  These  traces  correspond  to  four  parallel  scientific  applications  from  several 
domains. 


Some  of  our  conclusions  were: 


•  Given  the  high  steady-state  and  peak  demands  for  read  requests,  I/O  systems  should  be 
aggressive  on  optimizing  data  retrieval;  we  expect  that  simple  write-behind  policies  would  be 
effective  given  the  low  demand  for  writing  data. 

•  For  the  current  hardware,  an  I/O  system  that  delivers  a  read  bandwidth  of  about  19  MB/s 
per-processor  should  meet  the  requirements  of  even  the  most  demanding  applications  and 
that  a  read  bandwidth  of  10  MB/s  per-processor  should  be  adequate  for  most  applications. 

•  For  the  variety  of  applications  examined  in  this  study,  request  were  usually  large  and  the 
access  patterns  were  both  simpler  and  more  complex  than  nested  strides. 

•  There  is  little  or  no  write-sharing  between  processors. 

•  Local  disks  are  an  important  component  of  I/O  systems  for  parallel  machines. 

•  Different  application  require  different  caching  policies  -  in  particular,  they  need  different  cache 
replacement  policies  and  different  cache  placement  (server/client/both)  decisions.  Ideally,  I/O 
systems  for  parallel  machines  should  allow  the  application  to  control  or  specify  the  caching 
policy. 

•  For  many  applications,  the  sequence  of  inter-arrival  times  between  read  requests  can  be 
described  by  relatively  simple  patterns.  We  have  seen  three  patterns:  constant,  piece-wise 
constant  and  piece-wise  quadratic.  The  repetitve  nature  of  the  patterns  suggests  that  it  might 
be  possible  for  the  I/O  system  to  determine  the  pattern  during  the  first  few  repetitions  and 
to  use  that  information  to  estimate  the  inter- arrival  times  for  the  subsequent  repetitions.  We 
are  planning  to  explore  this  further. 


